Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add implementation of MultipartFormData streaming encoder #200

Merged
merged 14 commits into from
Feb 14, 2025

Conversation

andreiltd
Copy link
Contributor

@andreiltd andreiltd commented Jan 21, 2025

I did a quick comparison of encoded FormData between SM and Chrome using following JS code:

Simple FormData
async function readStream(stream) {
 const reader = stream.getReader();
 const chunks = [];

 while (true) {
   const { done, value } = await reader.read();
   if (done) {
     break;
   }
   chunks.push(value);
 }

 let totalLen = 0;
 for (const chunk of chunks) {
   totalLen += chunk.length;
 }

 const result = new Uint8Array(totalLen);

 let offset = 0;
 for (const chunk of chunks) {
   result.set(chunk, offset);
   offset += chunk.length;
 }

 return result;
}

async function main() {
 const form = new FormData();
 form.append('field1', 'value1');
 form.append('field2', 'value2');

 const file = new File(['Hello World!'], 'dummy.txt', { type: 'foo' });
 form.append('file1', file);

 const req = new Request('https:example.com', {
   method: 'POST',
   body: form,
 });

 const type = req.headers.get('Content-Type');
 const boundary = type.split('boundary=')[1];

 // assert(type.startsWith('multipart/form-data; boundary='), "Content-Type is multipart/form-data");
 const data = await readStream(req.body);
 const dec = new TextDecoder();
 const bodyStr = dec.decode(data);

 console.log(bodyStr);
}

main();

The results are:

  • Chrome:
------WebKitFormBoundaryhpShnP1JqrBTVTnC
Content-Disposition: form-data; name="field1"

value1
------WebKitFormBoundaryhpShnP1JqrBTVTnC
Content-Disposition: form-data; name="field2"

value2
------WebKitFormBoundaryhpShnP1JqrBTVTnC
Content-Disposition: form-data; name="file1"; filename="dummy.txt"
Content-Type: foo

Hello World!
------WebKitFormBoundaryhpShnP1JqrBTVTnC--

  • SM:
----BoundaryjXo5N4HEAXWcKrw7
Content-Disposition: form-data; name="field1"

value1
----BoundaryjXo5N4HEAXWcKrw7
Content-Disposition: form-data; name="field2"

value2
----BoundaryjXo5N4HEAXWcKrw7
Content-Disposition: form-data; name="file1"; filename="dummy.txt"
Content-Type: foo

Hello World!
----BoundaryjXo5N4HEAXWcKrw7--

I also did some manual fuzzing to test the streaming logic by varying the buffer sizes the encoder writes into. Specifically, I tested these buffer sizes in bytes: 1, 2, 4, 8, 32, 1024, and 8192 by changing the size in the implementation. Though, having some unit test framework would be nice.

Relevant RFC: https://www.rfc-editor.org/rfc/rfc2046#section-5.1.1

@andreiltd andreiltd marked this pull request as draft January 21, 2025 14:21
@andreiltd andreiltd marked this pull request as ready for review January 22, 2025 14:02
@andreiltd
Copy link
Contributor Author

andreiltd commented Jan 22, 2025

It looks like the CI failure is unrelated to this PR (I've observed the same failure on main branch). This is a problematic test:

    const response = await fetch("https://http-me.glitch.me/meow?header=cat:é");
    strictEqual(response.headers.get('cat'), "é");

@tschneidereit
Copy link
Member

@andreiltd, is this ready to review, or are you still expecting to make changes to it?

@andreiltd
Copy link
Contributor Author

Yes, this is ready to review :)

Copy link
Member

@tschneidereit tschneidereit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either I'm missing something in looking at the diff, or there is some code missing—see the inline comment on form-data-encoder.h.

One thing that I can't fully evaluate, because it'll presumably be part of the implementation of encode_stream: IIUC, field names, will always be first have their newlines normalized to CRLF, and then have those escaped. It'd be good to fold those into one operation, if possible.

Otherwise, I left a few comments and suggestions. I'm a bit concerned about allocation and general failure handling, so it'd be good to go over those aspects in some detail.

builtins/web/form-data/form-data-encoder.cpp Outdated Show resolved Hide resolved
builtins/web/form-data/form-data-encoder.cpp Outdated Show resolved Hide resolved
builtins/web/form-data/form-data-encoder.cpp Outdated Show resolved Hide resolved
builtins/web/form-data/form-data-encoder.h Outdated Show resolved Hide resolved
bool MultipartFormDataImpl::handle_entry_header(JSContext *cx, StreamContext &stream) {
auto entry = stream.entries->begin()[chunk_idx_];
auto header = fmt::memory_buffer();
auto name = escape_newlines(entry.name).value();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a null check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is overloaded but the version that doesn't take JSContetext is infallible. I can change this to separate function like escape_name and escape_name_val if we prefer explicit return value. Anyway, I've added the assertion for now.

builtins/web/form-data/form-data-encoder.cpp Outdated Show resolved Hide resolved
builtins/web/form-data/form-data-encoder.cpp Show resolved Hide resolved
builtins/web/form-data/form-data-encoder.cpp Show resolved Hide resolved
builtins/web/form-data/form-data-encoder.cpp Outdated Show resolved Hide resolved
builtins/web/form-data/form-data-encoder.cpp Show resolved Hide resolved
@andreiltd
Copy link
Contributor Author

andreiltd commented Jan 27, 2025

Hi @tschneidereit, as always, thank you so much for a through review!

One thing that I can't fully evaluate, because it'll presumably be part of the implementation of encode_stream: IIUC, field names, will always be first have their newlines normalized to CRLF, and then have those escaped. It'd be good to fold those into one operation, if possible.

So the spec says this:

For each entry of entry list:

  1. Replace every occurrence of U+000D (CR) not followed by U+000A (LF), and every occurrence of U+000A (LF) not preceded by U+000D (CR), in entry's name, by a string consisting of a U+000D (CR) and U+000A (LF).
  2. If entry's value is not a File object, then replace every occurrence of U+000D (CR) not followed by U+000A (LF), and every occurrence of U+000A (LF) not preceded by U+000D (CR), in entry's value, by a string consisting of a U+000D (CR) and U+000A (LF).

It's really hard to follow the spec closely because the implementation is state-machine-based, and a lot of actions that the spec calls for correspond to different states in the implementation and cannot be done together. For example, processing both names and values: names are part of the entry-header state, whereas values are handled in the entry-body state.

@andreiltd
Copy link
Contributor Author

andreiltd commented Feb 6, 2025

I've adopted a bunch of wpt tests here as an integration test. Those should ensure the we serialize FormData according to the spec.

Unfortunately, I wasn't able to run those in wpt tests because of few issues:

  • it looks like meta script functions are not visible from the test, e.g.: functions from META: script=../support/send-file-formdata-helper.js are missing when running this test
  • the wpt is hosting a bunch of python scripts and make the calls from within the tests, like this one:
    const formDataText = await (await fetch(
      `/fetch/api/resources/echo-content.py`,
      {
        method: "POST",
        body: formData,
      },
    )).text();

This returns an empty string.

@tschneidereit
Copy link
Member

I've adopted here a bunch of wpt tests here as an integration test. Those should ensure the we serialize FormData according to the spec.

That's a great idea!

Unfortunately, I wasn't able to run those in wpt tests because of few issues:

Can you say more about how you're trying to run the test? I can't see where in the PR the META: script line appears, or where the fetch to the python scripts is happening.

Taking a step back though, to fully adapt a test from WPT requires at least these steps:

  1. Remove all META: script lines with inlined (or imported versions) of that code, because these are only interpreted in the special WPT harness I wrote
  2. Remove all fetch requests to python scripts, because that server is only up while the WPT tests are running. The best way to do this is probably to change the fetch event handler to check for the url, e.g. /fetch/api/resources/echo-content.py from your example, and implement the logic of the script in JS. (Which in this case I guess means simply returning the incoming body as the outgoing one.)
  3. Ensuring that we retain the proper license. WPT code is licensed under 3-clause BSD, and we need to keep that. Probably the best way to do this is to have all tests derived from WPT in a separate folder, and add the WPT LICENSE.md file to that folder, with line 3 changed to Copyright © web-platform-tests contributors, StarlingMonkey contributors

@andreiltd
Copy link
Contributor Author

andreiltd commented Feb 6, 2025

Can you say more about how you're trying to run the test?

Oh, I was just trying to run wpt tests directly by whitelisting them in the tests.json file. Those are the tests: https://github.com/web-platform-tests/wpt/tree/master/FileAPI/file

The failures that I described were from running wpt-tests so I ended up rewriting them, exactly as yo said, and include in our integration tests.

Ensuring that we retain the proper license.

That's a good point. Would tests/integration/wpt be a good directory for maintaining the wpt derived tests?

@tschneidereit
Copy link
Member

That's a good point. Would tests/integration/wpt be a good directory for maintaining the wpt derived tests?

It would be, yes. Though if you're only doing this to work around weird issues with running WPT tests, then I would overall prefer figuring out what causes them. Can you give me steps to reproduce for the issue you were seeing with this test?

@andreiltd
Copy link
Contributor Author

I added the test like this:

diff --git a/tests/wpt-harness/tests.json b/tests/wpt-harness/tests.json
index 349baa1..fb757b9 100644
--- a/tests/wpt-harness/tests.json
+++ b/tests/wpt-harness/tests.json
@@ -175,6 +175,7 @@
   "FileAPI/blob/Blob-stream.any.js",
   "FileAPI/blob/Blob-text.any.js",
   "FileAPI/file/File-constructor.any.js",
+  "FileAPI/file/send-file-formdata.any.js",
   "hr-time/basic.any.js",
   "hr-time/idlharness.any.js",
   "hr-time/monotonic-clock.any.js",

and then running

just wpt-test FileAPI/file/send-file-formdata.any.js

gives me this error:

1: Running test FileAPI/file/send-file-formdata.any.js
1: Sending request to http://127.0.0.1:7676/FileAPI/file/send-file-formdata.any.js
1: wasmtime stderr: stderr [0] :: Warning: Using a DEBUG build. Expect things to be SLOW.
1: wasmtime stdout: stdout [0] :: Log: running test /FileAPI/file/send-file-formdata.any.js
1: wasmtime stdout: stdout [0] :: Log: error: ReferenceError: formDataPostFileUploadTest is not defined, stack:
1: stdout [0] :: @/FileAPI/file/send-file-formdata.any.js:5:3
1: stdout [0] :: [email protected]:5043:5
1: stdout [0] :: [email protected]:5024:19
1: stdout [0] :: async*@wpt-test-runner.js:5073:54
1: stdout [0] :: 

@andreiltd
Copy link
Contributor Author

andreiltd commented Feb 6, 2025

Just looking at how we import meta scripts here it looks like the meta scripts are evaluated separately from the test, shouldn't the meta and the test be concatenated instead?

@tschneidereit
Copy link
Member

Just looking at how we import meta scripts here it looks like the meta scripts are evaluated separately from the test, shouldn't the meta and the test be concatenated instead?

I don't entirely remember why I landed on this exact way of loading META scripts. I do remember that it was non-trivial to get things working, and included this weird thing where we replace let and const with var bindings. If instead we could just concatenate things, that'd be nice. It's possible that the key reason was to ensure that we run the test itself without changing the lines numbers: if we concatenate everything, we can't do that anymore. But as I said, I don't entirely remember

@andreiltd
Copy link
Contributor Author

Yeah, it would be really nice to have those tests working, it's worth dozens of wpt passes. Let me know if you plan to investigate otherwise I will allocate some time to it for next week. 🙂

@andreiltd
Copy link
Contributor Author

Merging #209 Should fix failing wpt tests in this PR.

@tschneidereit
Copy link
Member

This is looking quite excellent! 🥳 I think it's very close to being ready to land—however, before doing another in-depth review, it'd be great if you could go through the code and add a lot more links to the spec algorithms implemented. Additionally, adding comments explaining non-obvious design choices would be good, too.

In particular, explaining the boundary string code with a comment covering what you explained in the initial description in this PR would be good.

(It's very much okay to have these comments be long: we'll be grateful to have detailed explanations of all this stuff in the future!)

Copy link
Member

@tschneidereit tschneidereit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic work!

I left a last round of feedback, but none of that should require another look after you've addressed it, so I'm happy to merge once you let me know it's ready!

builtins/web/fetch/request-response.cpp Show resolved Hide resolved

auto boundary = MultipartFormData::boundary(encoder);
auto type = "multipart/form-data; boundary=" + boundary;
host_type_str = type.c_str();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just do host_type_str = type here, as HostString has a constructor taking a string_view. That at least saves one copy: .c_str does an internal copy to append the trailing \0. It's a bit unfortunate that we have to do another copy here, but this is all expensive enough that it probably doesn't matter at all. (Plus, LLVM might actually eliminate it.)

Copy link
Contributor Author

@andreiltd andreiltd Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to cleanup this code a bit. I think the problem here is that at the top of this function the content_type is defined as const char *content_type and then when we pass it to the Header::set_valid_if_undefined it is implicitly converted into a string_view. This conversion is UB if there is no trailing \0.

I changed this to use a string_view everywhere instead. This also eliminates copying.

tests/integration/fetch/fetch.js Outdated Show resolved Hide resolved
tests/integration/formdata/formdata.js Outdated Show resolved Hide resolved
builtins/web/form-data/form-data-encoder.cpp Show resolved Hide resolved
builtins/web/form-data/form-data-encoder.cpp Outdated Show resolved Hide resolved
builtins/web/form-data/form-data-encoder.cpp Show resolved Hide resolved
builtins/web/form-data/form-data-encoder.cpp Outdated Show resolved Hide resolved
builtins/web/form-data/form-data-encoder.cpp Outdated Show resolved Hide resolved
builtins/web/form-data/form-data-encoder.cpp Outdated Show resolved Hide resolved
@tschneidereit
Copy link
Member

Oh, and I forgot to mention: the comments you added are really excellent and made reviewing this much easier. Thank you!

@andreiltd
Copy link
Contributor Author

Hey @tschneidereit , thanks for review! I I've addressed all the issues from the last round except for the constructor suggestion. See my comment: #200 (comment)

Copy link
Member

@tschneidereit tschneidereit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳

@tschneidereit tschneidereit merged commit 856e923 into bytecodealliance:main Feb 14, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants